ai accelerator
DEX: Data Channel Extension for Efficient CNN Inference on Tiny AI Accelerators
Tiny machine learning (TinyML) aims to run ML models on small devices and is increasingly favored for its enhanced privacy, reduced latency, and low cost. Recently, the advent of tiny AI accelerators has revolutionized the TinyML field by significantly enhancing hardware processing power. These accelerators, equipped with multiple parallel processors and dedicated per-processor memory instances, offer substantial performance improvements over traditional microcontroller units (MCUs).
RIFT: A Scalable Methodology for LLM Accelerator Fault Assessment using Reinforcement Learning
Khalil, Khurram, Khaliq, Muhammad Mahad, Hoque, Khaza Anuarul
Abstract--The massive scale of modern AI accelerators presents critical challenges to traditional fault assessment methodologies, which face prohibitive computational costs and provide poor coverage of critical failure modes. This paper introduces RIFT (Reinforcement Learning-guided Intelligent Fault T argeting), a scalable framework that automates the discovery of minimal, high-impact fault scenarios for efficient design-time fault assessment. RIFT transforms the complex search for worst-case faults into a sequential decision-making problem, combining hybrid sensitivity analysis for search space pruning with reinforcement learning to intelligently generate minimal, high-impact test suites. Evaluated on billion-parameter Large Language Model (LLM) workloads using NVIDIA A100 GPUs, RIFT achieves a 2.2 fault assessment speedup over evolutionary methods and reduces the required test vector volume by over 99% compared to random fault injection, all while achieving superior fault coverage. The proposed framework also provides actionable data to enable intelligent hardware protection strategies, demonstrating that RIFT -guided selective error correction code provides a 12.8 improvement in cost-effectiveness (coverage per unit area) compared to uniform triple modular redundancy protection. RIFT automatically generates UVM-compliant verification artifacts, ensuring its findings are directly actionable and integrable into commercial RTL verification workflows. The recent advent of Large Language Models (LLMs) with hundreds of billions of parameters has had a transformative impact on computing, but has also introduced unprecedented computational demands [1].
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.68)
DCO: Dynamic Cache Orchestration for LLM Accelerators through Predictive Management
Zhou, Zhongchun, Lai, Chengtao, Gu, Yuhang, Zhang, Wei
Abstract--The rapid adoption of large language models (LLMs) is pushing AI accelerators toward increasingly powerful and specialized designs. Instead of further complicating software development with deeply hierarchical scratchpad memories (SPMs) and their asynchronous management, we investigate the opposite point of the design spectrum: a multi-core AI accelerator equipped with a shared system-level cache and application-aware management policies, which keeps the programming effort modest. Our approach exploits dataflow information available in the software stack to guide cache replacement (including dead-block prediction), in concert with bypass decisions and mechanisms that alleviate cache thrashing. We assess the proposal using a cycle-accurate simulator and observe substantial performance gains (up to 1.80x speedup) compared with conventional cache architectures. In addition, we build and validate an analytical model that takes into account the actual overlapping behaviors to extend the measurement results of our policies to real-world larger-scale workloads. Experiment results show that when functioning together, our bypassing and thrashing mitigation strategies can handle scenarios both with and without inter-core data sharing and achieve remarkable speedups. Finally, we implement the design in RTL and the area of our design is 0.064mm Our findings explore the potential of the shared cache design to assist the development of future AI accelerator systems. ITH the advent of the artificial intelligence (AI) era, the demand for AI-tailored hardware has surged across various environments, from data centers to embedded systems. A preliminary version of this paper appeared in the proceedings of ICS 2024. Z. Zhou and C. Lai contributed equally to this work. Z. Zhou and C. Lai are with the Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong (e-mail: zzhouch@connect.ust.hk; Gu is with the School of Electronic Science and Engineering, Southeast University, Nanjing, Jiangsu, China W . Zhang (corresponding author) is with the Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong (e-mail: eeweiz@ust.hk). Personal use of this material is permitted. These accelerators span a broad spectrum, from power-efficient devices to those designed for high computational throughput [34]. AI accelerators, compared with Graphics Processing Units (GPUs), can be optimized for AI applications and tailored for specific scenarios, such as pre-defined neural network (NN) computation graphs, operator types, certain data precision, and given power budgets. Since they are often used in scenarios where the execution graph is known during compilation, they typically employ software-controlled scratchpad memories (SPMs) as the on-chip storage.
- North America > United States (0.04)
- Europe > France (0.04)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Improving AI Efficiency in Data Centres by Power Dynamic Response
Marinoni, Andrea, Shivareddy, Sai, Lio', Pietro, Lin, Weisi, Cambria, Erik, Grey, Clare
The steady growth of artificial intelligence (AI) has accelerated in the recent years, facilitated by the development of sophisticated models such as large language models and foundation models. Ensuring robust and reliable power infrastructures is fundamental to take advantage of the full potential of AI. However, AI data centres are extremely hungry for power, putting the problem of their power management in the spotlight, especially with respect to their impact on environment and sustainable development. In this work, we investigate the capacity and limits of solutions based on an innovative approach for the power management of AI data centres, i.e., making part of the input power as dynamic as the power used for data-computing functions. The performance of passive and active devices are quantified and compared in terms of computational gain, energy efficiency, reduction of capital expenditure, and management costs by analysing power trends from multiple data platforms worldwide. This strategy, which identifies a paradigm shift in the AI data centre power management, has the potential to strongly improve the sustainability of AI hyperscalers, enhancing their footprint on environmental, financial, and societal fields.
- North America > United States (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Singapore (0.04)
- Research Report > Promising Solution (0.48)
- Overview > Innovation (0.48)
- Information Technology > Information Management (1.00)
- Information Technology > Cloud Computing (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.49)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
- North America > United States (0.04)
- Europe > France (0.04)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
AI Accelerators for Large Language Model Inference: Architecture Analysis and Scaling Strategies
This paper presents the first comprehensive cross - architectural performance analysis of contemporary AI accelerators designed for LLM inference, introducing a novel workload - centric evaluation methodology that quantifies architectural fitness across operational regimes. We provide the first systematic comparison of memory hierarchies, compute architectures, and interconnect strategies across the full spectrum of commercial accelerators, from GPU - based designs to specialized wafer - scale engines. Our analysis reveals that no single architecture dominates across all workload categories, with performance variations of up to 3.7 between architectures depending on batch size and sequence length. We quantitatively evaluate four primary scaling strategies for trillion - parameter models, demonstrating that expert parallelism delivers the best parameter - to - compute ratio (8.4) but introduces 2.1 latency variance compared to tensor parallelism. This work provides system designers with actionable insights for accelerator selection based on workload characteristics, while identifying key architectural gaps in current designs that will shape future hardware development.
- Research Report (0.82)
- Overview (0.68)
High-Throughput LLM inference on Heterogeneous Clusters
Xiong, Yi, Huang, Jinqi, Huang, Wenjie, Yu, Xuebing, Li, Entong, Ning, Zhixiong, Zhou, Jinhua, Zeng, Li, Chen, Xin
Nowadays, many companies possess various types of AI accelerators, forming heterogeneous clusters. Efficiently leveraging these clusters for high-throughput large language model (LLM) inference services can significantly reduce costs and expedite task processing. However, LLM inference on heterogeneous clusters presents two main challenges. Firstly, different deployment configurations can result in vastly different performance. The number of possible configurations is large, and evaluating the effectiveness of a specific setup is complex. Thus, finding an optimal configuration is not an easy task. Secondly, LLM inference instances within a heterogeneous cluster possess varying processing capacities, leading to different processing speeds for handling inference requests. Evaluating these capacities and designing a request scheduling algorithm that fully maximizes the potential of each instance is challenging. In this paper, we propose a high-throughput inference service system on heterogeneous clusters. First, the deployment configuration is optimized by modeling the resource amount and expected throughput and using the exhaustive search method. Second, a novel mechanism is proposed to schedule requests among instances, which fully considers the different processing capabilities of various instances. Extensive experiments show that the proposed scheduler improves throughput by 122.5% and 33.6% on two heterogeneous clusters, respectively.
Runtime Detection of Adversarial Attacks in AI Accelerators Using Performance Counters
Rahaman, Habibur, Chatterjee, Atri, Bhunia, Swarup
Rapid adoption of AI technologies raises several major security concerns, including the risks of adversarial perturbations, which threaten the confidentiality and integrity of AI applications. Protecting AI hardware from misuse and diverse security threats is a challenging task. To address this challenge, we propose SAMURAI, a novel framework for safeguarding against malicious usage of AI hardware and its resilience to attacks. SAMURAI introduces an AI Performance Counter (APC) for tracking dynamic behavior of an AI model coupled with an on-chip Machine Learning (ML) analysis engine, known as TANTO (Trained Anomaly Inspection Through Trace Observation). APC records the runtime profile of the low-level hardware events of different AI operations. Subsequently, the summary information recorded by the APC is processed by TANTO to efficiently identify potential security breaches and ensure secure, responsible use of AI. SAMURAI enables real-time detection of security threats and misuse without relying on traditional software-based solutions that require model integration. Experimental results demonstrate that SAMURAI achieves up to 97% accuracy in detecting adversarial attacks with moderate overhead on various AI models, significantly outperforming conventional software-based approaches. It enhances security and regulatory compliance, providing a comprehensive solution for safeguarding AI against emergent threats.
Exploring the Potential of Wireless-enabled Multi-Chip AI Accelerators
Irabor, Emmanuel, Musavi, Mariam, Das, Abhijit, Abadal, Sergi
The insatiable appetite of Artificial Intelligence (AI) workloads for computing power is pushing the industry to develop faster and more efficient accelerators. The rigidity of custom hardware, however, conflicts with the need for scalable and versatile architectures capable of catering to the needs of the evolving and heterogeneous pool of Machine Learning (ML) models in the literature. In this context, multi-chiplet architectures assembling multiple (perhaps heterogeneous) accelerators are an appealing option that is unfortunately hindered by the still rigid and inefficient chip-to-chip interconnects. In this paper, we explore the potential of wireless technology as a complement to existing wired interconnects in this multi-chiplet approach. Using an evaluation framework from the state-of-the-art, we show that wireless interconnects can lead to speedups of 10% on average and 20% maximum. We also highlight the importance of load balancing between the wired and wireless interconnects, which will be further explored in future work.